model evaluator

安装量: 44
排名: #16810

安装

npx skills add https://github.com/eddiebe147/claude-settings --skill 'Model Evaluator'
Model Evaluator
The Model Evaluator skill helps you rigorously assess and compare machine learning model performance across multiple dimensions. It guides you through selecting appropriate metrics, designing evaluation protocols, avoiding common statistical pitfalls, and making data-driven decisions about model selection.
Proper model evaluation goes beyond accuracy scores. This skill covers evaluation across the full spectrum: predictive performance, computational efficiency, robustness, fairness, calibration, and production readiness. It helps you answer not just "which model is best?" but "which model is best for my specific use case and constraints?"
Whether you are comparing LLMs, classifiers, or custom models, this skill ensures your evaluation methodology is sound and your conclusions are reliable.
Core Workflows
Workflow 1: Design Evaluation Protocol
Define
evaluation objectives:
Primary goal (accuracy, speed, cost, etc.)
Secondary constraints
Failure modes to test
Real-world conditions to simulate
Select
appropriate metrics:
Task Type
Primary Metrics
Secondary Metrics
Classification
Accuracy, F1, AUC-ROC
Precision, Recall, Confusion Matrix
Regression
RMSE, MAE, R-squared
Residual analysis, prediction intervals
Ranking
NDCG, MRR, MAP
Precision@k, Recall@k
Generation
BLEU, ROUGE, BERTScore
Human eval, Faithfulness
LLM
Task-specific accuracy
Latency, cost, consistency
Design
test sets:
Held-out test data
Edge case collections
Adversarial examples
Distribution shift tests
Plan
statistical methodology:
Sample sizes for significance
Confidence intervals
Multiple comparison corrections
Workflow 2: Execute Comparative Evaluation
Prepare
evaluation infrastructure:
class
ModelEvaluator
:
def
init
(
self
,
test_data
,
metrics
)
:
self
.
test_data
=
test_data
self
.
metrics
=
metrics
self
.
results
=
{
}
def
evaluate
(
self
,
model
,
model_name
)
:
predictions
=
model
.
predict
(
self
.
test_data
.
inputs
)
scores
=
{
}
for
metric
in
self
.
metrics
:
scores
[
metric
.
name
]
=
metric
.
compute
(
predictions
,
self
.
test_data
.
labels
)
self
.
results
[
model_name
]
=
scores
return
scores
def
compare
(
self
)
:
return
statistical_comparison
(
self
.
results
)
Run
evaluations consistently across models
Compute
confidence intervals
Test
for statistical significance
Generate
comparison report
Workflow 3: LLM-Specific Evaluation
Define
evaluation dimensions:
Task accuracy (factual, reasoning, coding)
Response quality (coherence, relevance, style)
Safety and alignment
Efficiency (tokens, latency, cost)
Create
evaluation datasets:
Representative prompts
Ground truth answers (where applicable)
Human preference data
Implement
LLM evaluation:
Automated metrics (exact match, semantic similarity)
LLM-as-judge evaluations
Human evaluation protocols
Analyze
results across dimensions
Make
recommendations with tradeoffs
Quick Reference
Action
Command/Trigger
Design evaluation
"How should I evaluate [model type]"
Choose metrics
"What metrics for [task type]"
Compare models
"Compare these models: [list]"
LLM evaluation
"Evaluate LLM performance"
Statistical testing
"Is this difference significant"
Bias evaluation
"Check model for bias"
Best Practices
Use Multiple Metrics
No single metric tells the whole story
Include both aggregate and granular metrics
Report confidence intervals, not just point estimates
Show performance across subgroups
Test on Realistic Data
Evaluation data should match production
Same distribution as real inputs
Include edge cases and hard examples
Test on data the model hasn't seen
Account for Variance
Models and data have randomness
Run multiple seeds for training-based evaluations
Bootstrap confidence intervals
Use proper statistical tests for comparison
Consider All Costs
Performance isn't just accuracy
Inference latency and throughput
Memory and compute requirements
API costs for hosted models
Maintenance and update burden
Test Robustness
How does the model handle adversity?
Input perturbations and noise
Distribution shift
Adversarial examples
Missing or malformed inputs
Evaluate Fairly
Ensure fair comparison across models Same test data for all models Consistent preprocessing Equivalent hyperparameter tuning effort Document any advantages/disadvantages Advanced Techniques Multi-Dimensional Evaluation Score models across multiple axes: def multi_dim_evaluate ( model , test_data ) : return { "accuracy" : compute_accuracy ( model , test_data ) , "latency_p50" : measure_latency ( model , test_data , percentile = 50 ) , "latency_p99" : measure_latency ( model , test_data , percentile = 99 ) , "memory_mb" : measure_memory ( model ) , "cost_per_1k" : compute_cost ( model , n = 1000 ) , "robustness" : adversarial_accuracy ( model , test_data ) , "fairness" : demographic_parity ( model , test_data ) } LLM-as-Judge Protocol Use LLMs to evaluate LLM outputs: Prompt template: "Rate the following response on a scale of 1-5 for: - Accuracy: Is the information correct? - Helpfulness: Does it address the user's need? - Clarity: Is it easy to understand? Question: {question} Response: {response} Ground truth (if available): {ground_truth} Provide scores and brief justification." A/B Testing Framework For production evaluation: class ABTest : def init ( self , model_a , model_b , traffic_split = 0.5 ) : self . models = { "A" : model_a , "B" : model_b } self . split = traffic_split self . results = { "A" : [ ] , "B" : [ ] } def serve ( self , request ) : variant = "A" if random . random ( ) < self . split else "B" response = self . models [ variant ] . predict ( request ) return response , variant def record_outcome ( self , variant , success ) : self . results [ variant ] . append ( success ) def compute_significance ( self ) : return statistical_test ( self . results [ "A" ] , self . results [ "B" ] ) Calibration Analysis Ensure predicted probabilities are meaningful: - Expected Calibration Error (ECE) - Reliability diagrams - Brier score decomposition - Temperature scaling for recalibration Common Pitfalls to Avoid Overfitting to the test set through repeated evaluation Ignoring statistical significance in model comparisons Using inappropriate metrics for the task (accuracy for imbalanced classes) Evaluating on data too similar to training data Ignoring computational costs in model selection Not testing robustness to distribution shift Conflating correlation with causation in A/B tests Failing to account for multiple comparisons in statistical tests
返回排行榜